Marks: 60
The stock market has consistently proven to be a good place to invest in and save for the future. There are a lot of compelling reasons to invest in stocks. It can help in fighting inflation, create wealth, and also provides some tax benefits. Good steady returns on investments over a long period of time can also grow a lot more than seems possible. Also, thanks to the power of compound interest, the earlier one starts investing, the larger the corpus one can have for retirement. Overall, investing in stocks can help meet life's financial aspirations.
It is important to maintain a diversified portfolio when investing in stocks in order to maximise earnings under any market condition. Having a diversified portfolio tends to yield higher returns and face lower risk by tempering potential losses when the market is down. It is often easy to get lost in a sea of financial metrics to analyze while determining the worth of a stock, and doing the same for a multitude of stocks to identify the right picks for an individual can be a tedious task. By doing a cluster analysis, one can identify stocks that exhibit similar characteristics and ones which exhibit minimum correlation. This will help investors better analyze stocks across different market segments and help protect against risks that could make the portfolio vulnerable to losses.
Trade&Ahead is a financial consultancy firm who provide their customers with personalized investment strategies. They have hired you as a Data Scientist and provided you with data comprising stock price and some financial indicators for a few companies listed under the New York Stock Exchange. They have assigned you the tasks of analyzing the data, grouping the stocks based on the attributes provided, and sharing insights about the characteristics of each group.
From time to time, you will find the %%time command included in some cells. It is there because I was interested in knowing the execution time of some cells, especially the cells than run complex commands. The command is not critical to the final outcome, but interesting to know.
For the bar plots, I printed the totals and percentages as well as the bar plot as I like to see the totals and percentages.
I am not the world's greatest speller so there will be spelling mistakes. Jupyter really needs a spell checker.
# This will help in making the Python code more structured automatically (good coding practice)
%load_ext nb_black
# Libraries to help with reading and manipulating data
import numpy as np
import pandas as pd
# Libraries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_theme(style='darkgrid')
# Removes the limit for the number of displayed columns
pd.set_option("display.max_columns", None)
# Sets the limit for the number of displayed rows
pd.set_option("display.max_rows", 200)
# to scale the data using z-score
from sklearn.preprocessing import StandardScaler
# to compute distances
from scipy.spatial.distance import cdist, pdist
# to perform k-means clustering and compute silhouette scores
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
# to visualize the elbow curve and silhouette scores
from yellowbrick.cluster import KElbowVisualizer, SilhouetteVisualizer
# to perform hierarchical clustering, compute cophenetic correlation, and create dendrograms
from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram, linkage, cophenet
# To supress scientific notations for a dataframe
pd.set_option("display.float_format", lambda x: "%.3f" % x)
# To suppress warnings
import warnings
warnings.filterwarnings("ignore")
I am old fashioned and was originally trained in procedural code. I like defining functions at the top of the notebook. I also find it is better to have everything defined upfront so you can define and load everything all at once. The flow of the code is also not interrupted with a function defintion.
# code to enable the printing of headers in bold and maybe colors
class color:
PURPLE = "\033[95m"
CYAN = "\033[96m"
DARKCYAN = "\033[36m"
BLUE = "\033[94m"
GREEN = "\033[92m"
YELLOW = "\033[93m"
RED = "\033[91m"
BOLD = "\033[1m"
UNDERLINE = "\033[4m"
END = "\033[0m"
# print(color.BOLD + 'Hello World !' + color.END)
def histogram_boxplot(data, feature, figsize=(12, 7), kde=True, bins=None):
"""
Boxplot and histogram combined
data: dataframe
feature: dataframe column
figsize: size of figure (default (12,7))
kde: whether to show the density curve (default True)
bins: number of bins for histogram (default None)
"""
f2, (ax_box2, ax_hist2) = plt.subplots(
nrows=2, # Number of rows of the subplot grid= 2
sharex=True, # x-axis will be shared among all subplots
gridspec_kw={"height_ratios": (0.25, 0.75)},
figsize=figsize,
) # creating the 2 subplots
sns.boxplot(
data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
) # boxplot will be created and a star will indicate the mean value of the column
sns.histplot(
data=data, x=feature, kde=True, ax=ax_hist2, bins=bins, palette="winter"
) if bins else sns.histplot(
data=data, x=feature, kde=True, ax=ax_hist2
) # For histogram
ax_hist2.axvline(
data[feature].mean(), color="green", linestyle="--"
) # Add mean to the histogram
ax_hist2.axvline(
data[feature].median(), color="black", linestyle="-"
) # Add median to the histogram
# function to create labeled barplots
def labeled_barplot(data, feature, perc=True, n=None):
"""
Barplot with percentage at the top
data: dataframe
feature: dataframe column
perc: whether to display percentages instead of count (default is False)
n: displays the top n category levels (default is None, i.e., display all levels)
"""
total = len(data[feature]) # length of the column
count = data[feature].nunique()
if n is None:
plt.figure(figsize=(count + 1, 5))
else:
plt.figure(figsize=(n + 1, 5))
plt.xticks(rotation=90, fontsize=15)
ax = sns.countplot(
data=data,
x=feature,
palette="Paired",
order=data[feature].value_counts().index[:n],
)
for p in ax.patches:
if perc == True:
label = "{:.1f}%".format(
100 * p.get_height() / total
) # percentage of each class of the category
else:
label = p.get_height() # count of each level of the category
x = p.get_x() + p.get_width() / 2 # width of the plot
y = p.get_height() # height of the plot
ax.annotate(
label,
(x, y),
ha="center",
va="center",
size=12,
xytext=(0, 5),
textcoords="offset points",
) # annotate the percentage
plt.show() # show the plot
# function to get a value count and percentage for a column in a dataset
def count_percentage(data, contents):
"""
To get a value count on a column and its percentage
data: dataframe
contents: column to be counted
"""
function_df = pd.DataFrame(columns=[contents, "Percentage"])
function_df[contents] = data[contents].value_counts()
function_df["Percentage"] = data[contents].value_counts(normalize=True)
print(contents, "with percentage of total \n")
print(function_df)
del function_df # free memory
# Function to display values on seaborn barplot
def show_values(axs, orient="v", space=0.01):
def _single(ax):
if orient == "v":
for p in ax.patches:
_x = p.get_x() + p.get_width() / 2
_y = p.get_y() + p.get_height() + (p.get_height() * 0.01)
value = "{:.1f}".format(p.get_height())
ax.text(_x, _y, value, ha="center")
elif orient == "h":
for p in ax.patches:
_x = p.get_x() + p.get_width() + float(space)
_y = p.get_y() + p.get_height() - (p.get_height() * 0.5)
value = "{:.1f}".format(p.get_width())
ax.text(_x, _y, value, ha="left")
if isinstance(axs, np.ndarray):
for idx, ax in np.ndenumerate(axs):
_single(ax)
else:
_single(axs)
# read the data
df = pd.read_csv(
"C:/Users/David/LibreOffice_Personal/DataScience/TestData_Files/stock_data.csv"
)
# Ran code on local machine using Anaconda
df.sample(10)
| Ticker Symbol | Security | GICS Sector | GICS Sub Industry | Current Price | Price Change | Volatility | ROE | Cash Ratio | Net Cash Flow | Net Income | Earnings Per Share | Estimated Shares Outstanding | P/E Ratio | P/B Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 233 | NWL | Newell Brands | Consumer Discretionary | Housewares & Specialties | 44.080 | 9.980 | 1.641 | 19 | 14 | 75400000 | 350000000 | 1.300 | 269230769.200 | 33.908 | -2.076 |
| 129 | FIS | Fidelity National Information Services | Information Technology | Internet Software & Services | 60.600 | -10.554 | 1.148 | 7 | 29 | 194800000 | 650800000 | 2.220 | 293153153.200 | 27.297 | -19.087 |
| 5 | ADM | Archer-Daniels-Midland Co | Consumer Staples | Agricultural Products | 36.680 | -12.017 | 1.516 | 10 | 49 | -189000000 | 1849000000 | 2.990 | 618394648.800 | 12.268 | 7.497 |
| 330 | XL | XL Capital | Financials | Property & Casualty Insurance | 39.180 | 7.697 | 0.991 | 10 | 99 | 734422000 | 1201560000 | 4.220 | 284729857.800 | 9.284 | -7.763 |
| 140 | GILD | Gilead Sciences | Health Care | Biotechnology | 101.190 | 2.689 | 1.494 | 98 | 148 | 2824000000 | 18108000000 | 12.370 | 1463864188.000 | 8.180 | 3.159 |
| 277 | SPG | Simon Property Group Inc | Real Estate | REITs | 194.440 | 5.285 | 1.136 | 48 | 47 | 88852000 | 2139375000 | 5.880 | 363839285.700 | 33.068 | -1.269 |
| 83 | CVX | Chevron Corp. | Energy | Integrated Oil & Gas | 89.960 | 12.845 | 1.751 | 3 | 43 | -1763000000 | 4587000000 | 2.460 | 1864634146.000 | 36.569 | 4.764 |
| 173 | IVZ | Invesco Ltd. | Financials | Asset Management & Custody Banks | 33.480 | 7.067 | 1.581 | 12 | 67 | 412000000 | 968100000 | 2.260 | 428362831.900 | 14.814 | 4.219 |
| 224 | NEE | NextEra Energy | Utilities | MultiUtilities | 103.890 | 6.238 | 1.023 | 12 | 6 | -6000000 | 2752000000 | 6.110 | 450409165.300 | 17.003 | -7.353 |
| 27 | AN | AutoNation Inc | Consumer Discretionary | Specialty Stores | 59.660 | 2.350 | 1.481 | 19 | 1 | -1300000 | 442600000 | 3.930 | 112620865.100 | 15.181 | -7.970 |
print("There are", df.shape[0], "rows and", df.shape[1], "columns.")
There are 340 rows and 15 columns.
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 340 entries, 0 to 339 Data columns (total 15 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Ticker Symbol 340 non-null object 1 Security 340 non-null object 2 GICS Sector 340 non-null object 3 GICS Sub Industry 340 non-null object 4 Current Price 340 non-null float64 5 Price Change 340 non-null float64 6 Volatility 340 non-null float64 7 ROE 340 non-null int64 8 Cash Ratio 340 non-null int64 9 Net Cash Flow 340 non-null int64 10 Net Income 340 non-null int64 11 Earnings Per Share 340 non-null float64 12 Estimated Shares Outstanding 340 non-null float64 13 P/E Ratio 340 non-null float64 14 P/B Ratio 340 non-null float64 dtypes: float64(7), int64(4), object(4) memory usage: 40.0+ KB
print("The number of null values for each variable is:")
df.isnull().sum()
The number of null values for each variable is:
Ticker Symbol 0 Security 0 GICS Sector 0 GICS Sub Industry 0 Current Price 0 Price Change 0 Volatility 0 ROE 0 Cash Ratio 0 Net Cash Flow 0 Net Income 0 Earnings Per Share 0 Estimated Shares Outstanding 0 P/E Ratio 0 P/B Ratio 0 dtype: int64
print("The number of duplicated rows are: ", df.duplicated().sum())
The number of duplicated rows are: 0
df.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| Current Price | 340.000 | 80.862 | 98.055 | 4.500 | 38.555 | 59.705 | 92.880 | 1274.950 |
| Price Change | 340.000 | 4.078 | 12.006 | -47.130 | -0.939 | 4.820 | 10.695 | 55.052 |
| Volatility | 340.000 | 1.526 | 0.592 | 0.733 | 1.135 | 1.386 | 1.696 | 4.580 |
| ROE | 340.000 | 39.597 | 96.548 | 1.000 | 9.750 | 15.000 | 27.000 | 917.000 |
| Cash Ratio | 340.000 | 70.024 | 90.421 | 0.000 | 18.000 | 47.000 | 99.000 | 958.000 |
| Net Cash Flow | 340.000 | 55537620.588 | 1946365312.176 | -11208000000.000 | -193906500.000 | 2098000.000 | 169810750.000 | 20764000000.000 |
| Net Income | 340.000 | 1494384602.941 | 3940150279.328 | -23528000000.000 | 352301250.000 | 707336000.000 | 1899000000.000 | 24442000000.000 |
| Earnings Per Share | 340.000 | 2.777 | 6.588 | -61.200 | 1.558 | 2.895 | 4.620 | 50.090 |
| Estimated Shares Outstanding | 340.000 | 577028337.754 | 845849595.418 | 27672156.860 | 158848216.100 | 309675137.800 | 573117457.325 | 6159292035.000 |
| P/E Ratio | 340.000 | 32.613 | 44.349 | 2.935 | 15.045 | 20.820 | 31.765 | 528.039 |
| P/B Ratio | 340.000 | -1.718 | 13.967 | -76.119 | -4.352 | -1.067 | 3.917 | 129.065 |
# Setting cols_list as a constant as it will be used in a number of cells
cols_list = df.select_dtypes(include=np.number).columns.tolist()
========================================================================================================================
Questions:
%%time
print(color.BOLD + 'The numerical columns in the data are:' + color.END)
for x in range(len(cols_list)):
print('\t', cols_list[x])
print('\n')
for feature in cols_list:
histogram_boxplot(df, feature, figsize=(12, 7), kde=False, bins=None)
The numerical columns in the data are:
Current Price
Price Change
Volatility
ROE
Cash Ratio
Net Cash Flow
Net Income
Earnings Per Share
Estimated Shares Outstanding
P/E Ratio
P/B Ratio
Wall time: 1.12 s
It is about even when it comes down to which metrics are right skewed and which are realitivly normally distributed
%%time
count_percentage(df, "GICS Sector")
print("-" * 50)
print("\n")
labeled_barplot(df, "GICS Sector")
GICS Sector with percentage of total
GICS Sector Percentage
Industrials 53 0.156
Financials 49 0.144
Health Care 40 0.118
Consumer Discretionary 40 0.118
Information Technology 33 0.097
Energy 30 0.088
Real Estate 27 0.079
Utilities 24 0.071
Materials 20 0.059
Consumer Staples 19 0.056
Telecommunications Services 5 0.015
--------------------------------------------------
Wall time: 358 ms
The sectors with the biggest growth are:
The sector with the least growth is:
%%time
count_percentage(df, "GICS Sub Industry")
print("-" * 50)
print("\n")
labeled_barplot(df, "GICS Sub Industry")
GICS Sub Industry with percentage of total
GICS Sub Industry Percentage
Oil & Gas Exploration & Production 16 0.047
REITs 14 0.041
Industrial Conglomerates 14 0.041
Electric Utilities 12 0.035
Internet Software & Services 12 0.035
Health Care Equipment 11 0.032
MultiUtilities 11 0.032
Banks 10 0.029
Property & Casualty Insurance 8 0.024
Diversified Financial Services 7 0.021
Biotechnology 7 0.021
Pharmaceuticals 6 0.018
Packaged Foods & Meats 6 0.018
Oil & Gas Refining & Marketing & Transportation 6 0.018
Semiconductors 6 0.018
Diversified Chemicals 5 0.015
Consumer Finance 5 0.015
Integrated Oil & Gas 5 0.015
Industrial Machinery 5 0.015
Managed Health Care 5 0.015
Airlines 5 0.015
Health Care Facilities 5 0.015
Internet & Direct Marketing Retail 4 0.012
Residential REITs 4 0.012
Retail REITs 4 0.012
Soft Drinks 4 0.012
Research & Consulting Services 4 0.012
Asset Management & Custody Banks 4 0.012
Specialty Chemicals 4 0.012
Railroads 4 0.012
Aerospace & Defense 4 0.012
Integrated Telecommunications Services 4 0.012
Building Products 4 0.012
Hotels, Resorts & Cruise Lines 4 0.012
Restaurants 3 0.009
Cable & Satellite 3 0.009
Air Freight & Logistics 3 0.009
Household Products 3 0.009
Regional Banks 3 0.009
IT Consulting & Other Services 3 0.009
Construction & Farm Machinery & Heavy Trucks 3 0.009
Life & Health Insurance 3 0.009
Health Care Distributors 3 0.009
Specialized REITs 3 0.009
Oil & Gas Equipment & Services 3 0.009
Insurance Brokers 3 0.009
Specialty Stores 3 0.009
Fertilizers & Agricultural Chemicals 2 0.006
Tobacco 2 0.006
Leisure Products 2 0.006
Advertising 2 0.006
Data Processing & Outsourced Services 2 0.006
Paper Packaging 2 0.006
Construction Materials 2 0.006
Application Software 2 0.006
Homebuilding 2 0.006
Automobile Manufacturers 2 0.006
Investment Banking & Brokerage 2 0.006
Broadcasting & Cable TV 2 0.006
Auto Parts & Equipment 2 0.006
Health Care Supplies 2 0.006
Electronic Components 2 0.006
Gold 1 0.003
Environmental Services 1 0.003
Home Furnishings 1 0.003
Household Appliances 1 0.003
Apparel, Accessories & Luxury Goods 1 0.003
Specialty Retail 1 0.003
Life Sciences Tools & Services 1 0.003
Publishing 1 0.003
Human Resource & Employment Services 1 0.003
Steel 1 0.003
Housewares & Specialties 1 0.003
Thrifts & Mortgage Finance 1 0.003
Brewers 1 0.003
Technology, Hardware, Software and Supplies 1 0.003
Personal Products 1 0.003
Industrial Gases 1 0.003
Office REITs 1 0.003
Multi-Sector Holdings 1 0.003
Alternative Carriers 1 0.003
Computer Hardware 1 0.003
Distributors 1 0.003
Agricultural Products 1 0.003
Metal & Glass Containers 1 0.003
Financial Exchanges & Data 1 0.003
Water Utilities 1 0.003
Home Entertainment Software 1 0.003
Drug Retail 1 0.003
Electrical Components & Equipment 1 0.003
Semiconductor Equipment 1 0.003
Multi-line Insurance 1 0.003
Copper 1 0.003
Electronic Equipment & Instruments 1 0.003
Diversified Commercial Services 1 0.003
Consumer Electronics 1 0.003
Tires & Rubber 1 0.003
Industrial Materials 1 0.003
Motorcycle Manufacturers 1 0.003
Technology Hardware, Storage & Peripherals 1 0.003
Real Estate Services 1 0.003
Trucking 1 0.003
Networking Equipment 1 0.003
Casinos & Gaming 1 0.003
--------------------------------------------------
Wall time: 5.28 s
The subsectors with the largest growth are: </br>
* Note: growth is defined as Growth >= 0.020
The subsectors with middle level growth are: </br>
* Note: growth is defined as Growth >= 0.01 and < 0.20
All other sectors are low growth. </br>
* Low growth is defined as Growth < 0.09
%%time
plt.figure(figsize=(20, 12))
sns.heatmap(df.corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral")
plt.show()
Wall time: 483 ms
These variables have a high correlation:
These variables have a low correlation:
All the other variables have a low positive or low negative correlation. This is not surprising that Volatility shows up so many times the low correlation list. Markets, in general, do not like volatility situation. This can be seen from the events of 2020 though 2022.
%%time
tempdf = df.copy(cols_list)
sns.pairplot(tempdf, diag_kind='kde')
del tempdf
Wall time: 7.2 s
<seaborn.axisgrid.PairGrid at 0x27b775cb880>
This chart reinforces what has been see in the heatmap and other graphs. The variables are not completely independent of each other and the KDE is either right skewed or approximately normally skewed.
plt.figure(figsize=(15, 8))
p = sns.barplot(data=df, x="GICS Sector", y="Price Change", ci=None)
plt.xticks(rotation=90)
show_values(p)
Telecommunications Services
The sector with the least growth and frankly negative gowth is:
Energy seems counter intuitive given the focus on renewalable and green energy
plt.figure(figsize=(15, 8))
p = sns.barplot(data=df, x="GICS Sector", y="Cash Ratio", ci=None)
plt.xticks(rotation=90)
show_values(p)
The sectors with the highest average cash ratios across the sectors are:
These sectors will have the greatest chance of meeting their financial obligations, like payroll, from existing cash and cash equivalents without having to generate a debit instrument like Commercial Paper.
The sectors with the lowest average cash ratios across the sectors are:
These sectors will have the lowest chance of meeting their financial obligations, like payroll, from existing cash and cash equivalents without having to generate a debit instrument like Commerical Paper
plt.figure(figsize=(15, 8))
p = sns.barplot(data=df, x="GICS Sector", y="Net Cash Flow", ci=None)
plt.xticks(rotation=90)
show_values(p)
As for actual cash flow across each sector, these sectors have the lowest net cash flow to expenses:
The real estate sectors is just breaking even. </br> All other sectors have a positive cash flow
plt.figure(figsize=(15, 8))
p = sns.barplot(data=df, x="GICS Sector", y="P/E Ratio", ci=None)
plt.xticks(rotation=90)
show_values(p)
The price to earnings ratios are the highest in these sectors:
This means an investor has the greatest chance of getting a positive return in investment if they were to invest in these sectors.
The price to earnings ratios are the lowest in these sectors:
This means an investor has the lowest chance of getting a positive return in investment if they were to invest in these sectors.
plt.figure(figsize=(15, 8))
p = sns.barplot(data=df, x="GICS Sector", y="Volatility", ci=None)
plt.xticks(rotation=90)
show_values(p)
======================================================================================================================
%%time
plt.figure(figsize=(15, 12))
for i, variable in enumerate(cols_list):
plt.subplot(3, 4, i + 1)
plt.boxplot(df[variable], whis=1.5)
plt.tight_layout()
plt.title(variable)
plt.show()
Wall time: 1.08 s
The outliers will not be treated as treating them with skew the data to the point the data is not longer valuable
=====================================================================================================================
for feature in cols_list:
histogram_boxplot(df, feature, figsize=(12, 7), kde=False, bins=None)
count_percentage(df, "GICS Sector")
print("-" * 50)
print("\n")
labeled_barplot(df, "GICS Sector")
GICS Sector with percentage of total
GICS Sector Percentage
Industrials 53 0.156
Financials 49 0.144
Health Care 40 0.118
Consumer Discretionary 40 0.118
Information Technology 33 0.097
Energy 30 0.088
Real Estate 27 0.079
Utilities 24 0.071
Materials 20 0.059
Consumer Staples 19 0.056
Telecommunications Services 5 0.015
--------------------------------------------------
count_percentage(df, "GICS Sub Industry")
print("-" * 50)
print("\n")
labeled_barplot(df, "GICS Sub Industry")
GICS Sub Industry with percentage of total
GICS Sub Industry Percentage
Oil & Gas Exploration & Production 16 0.047
REITs 14 0.041
Industrial Conglomerates 14 0.041
Electric Utilities 12 0.035
Internet Software & Services 12 0.035
Health Care Equipment 11 0.032
MultiUtilities 11 0.032
Banks 10 0.029
Property & Casualty Insurance 8 0.024
Diversified Financial Services 7 0.021
Biotechnology 7 0.021
Pharmaceuticals 6 0.018
Packaged Foods & Meats 6 0.018
Oil & Gas Refining & Marketing & Transportation 6 0.018
Semiconductors 6 0.018
Diversified Chemicals 5 0.015
Consumer Finance 5 0.015
Integrated Oil & Gas 5 0.015
Industrial Machinery 5 0.015
Managed Health Care 5 0.015
Airlines 5 0.015
Health Care Facilities 5 0.015
Internet & Direct Marketing Retail 4 0.012
Residential REITs 4 0.012
Retail REITs 4 0.012
Soft Drinks 4 0.012
Research & Consulting Services 4 0.012
Asset Management & Custody Banks 4 0.012
Specialty Chemicals 4 0.012
Railroads 4 0.012
Aerospace & Defense 4 0.012
Integrated Telecommunications Services 4 0.012
Building Products 4 0.012
Hotels, Resorts & Cruise Lines 4 0.012
Restaurants 3 0.009
Cable & Satellite 3 0.009
Air Freight & Logistics 3 0.009
Household Products 3 0.009
Regional Banks 3 0.009
IT Consulting & Other Services 3 0.009
Construction & Farm Machinery & Heavy Trucks 3 0.009
Life & Health Insurance 3 0.009
Health Care Distributors 3 0.009
Specialized REITs 3 0.009
Oil & Gas Equipment & Services 3 0.009
Insurance Brokers 3 0.009
Specialty Stores 3 0.009
Fertilizers & Agricultural Chemicals 2 0.006
Tobacco 2 0.006
Leisure Products 2 0.006
Advertising 2 0.006
Data Processing & Outsourced Services 2 0.006
Paper Packaging 2 0.006
Construction Materials 2 0.006
Application Software 2 0.006
Homebuilding 2 0.006
Automobile Manufacturers 2 0.006
Investment Banking & Brokerage 2 0.006
Broadcasting & Cable TV 2 0.006
Auto Parts & Equipment 2 0.006
Health Care Supplies 2 0.006
Electronic Components 2 0.006
Gold 1 0.003
Environmental Services 1 0.003
Home Furnishings 1 0.003
Household Appliances 1 0.003
Apparel, Accessories & Luxury Goods 1 0.003
Specialty Retail 1 0.003
Life Sciences Tools & Services 1 0.003
Publishing 1 0.003
Human Resource & Employment Services 1 0.003
Steel 1 0.003
Housewares & Specialties 1 0.003
Thrifts & Mortgage Finance 1 0.003
Brewers 1 0.003
Technology, Hardware, Software and Supplies 1 0.003
Personal Products 1 0.003
Industrial Gases 1 0.003
Office REITs 1 0.003
Multi-Sector Holdings 1 0.003
Alternative Carriers 1 0.003
Computer Hardware 1 0.003
Distributors 1 0.003
Agricultural Products 1 0.003
Metal & Glass Containers 1 0.003
Financial Exchanges & Data 1 0.003
Water Utilities 1 0.003
Home Entertainment Software 1 0.003
Drug Retail 1 0.003
Electrical Components & Equipment 1 0.003
Semiconductor Equipment 1 0.003
Multi-line Insurance 1 0.003
Copper 1 0.003
Electronic Equipment & Instruments 1 0.003
Diversified Commercial Services 1 0.003
Consumer Electronics 1 0.003
Tires & Rubber 1 0.003
Industrial Materials 1 0.003
Motorcycle Manufacturers 1 0.003
Technology Hardware, Storage & Peripherals 1 0.003
Real Estate Services 1 0.003
Trucking 1 0.003
Networking Equipment 1 0.003
Casinos & Gaming 1 0.003
--------------------------------------------------
=====================================================================================================================
# Scaling the data set before clustering
scaler = StandardScaler()
subset = df[cols_list].copy()
subset_scaled = scaler.fit_transform(subset)
# Creating a dataframe from the scaled data
subset_scaled_df = pd.DataFrame(subset_scaled, columns=subset.columns)
subset_scaled_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 340 entries, 0 to 339 Data columns (total 11 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Current Price 340 non-null float64 1 Price Change 340 non-null float64 2 Volatility 340 non-null float64 3 ROE 340 non-null float64 4 Cash Ratio 340 non-null float64 5 Net Cash Flow 340 non-null float64 6 Net Income 340 non-null float64 7 Earnings Per Share 340 non-null float64 8 Estimated Shares Outstanding 340 non-null float64 9 P/E Ratio 340 non-null float64 10 P/B Ratio 340 non-null float64 dtypes: float64(11) memory usage: 29.3 KB
%%time
sns.pairplot(subset_scaled_df, diag_kind='kde')
Wall time: 9.58 s
<seaborn.axisgrid.PairGrid at 0x2311f6d4f70>
========================================================================================================================
# Making a copy of the scaled data set
k_means_df = subset_scaled_df.copy()
%%time
clusters = range(1, 15)
meanDistortions = []
i = 1 # for printing difference in distortion
for k in clusters:
model = KMeans(n_clusters=k, random_state=1)
model.fit(subset_scaled_df)
prediction = model.predict(k_means_df)
distortion = (
sum(np.min(cdist(k_means_df, model.cluster_centers_, "euclidean"), axis=1))
/ k_means_df.shape[0]
)
meanDistortions.append(distortion)
if i == 1: # print distortion and difference from previous distortion
print("Number of Clusters:", k, "\tAverage Distortion:", distortion)
pastdist = distortion
i += 1
else:
print("Number of Clusters:", k, "\tAverage Distortion:", distortion, "\t difference:", pastdist-distortion)
pastdist = distortion
i += 1
print('\n')
plt.plot(clusters, meanDistortions, "bx-")
plt.xlabel("k")
plt.ylabel("Average Distortion")
plt.title("Selecting k with the Elbow Method", fontsize=20)
plt.show()
Number of Clusters: 1 Average Distortion: 2.5425069919221697 Number of Clusters: 2 Average Distortion: 2.382318498894466 difference: 0.16018849302770377 Number of Clusters: 3 Average Distortion: 2.2692367155390745 difference: 0.11308178335539143 Number of Clusters: 4 Average Distortion: 2.1745559827866363 difference: 0.09468073275243816 Number of Clusters: 5 Average Distortion: 2.128799332840716 difference: 0.0457566499459201 Number of Clusters: 6 Average Distortion: 2.080400099226289 difference: 0.04839923361442722 Number of Clusters: 7 Average Distortion: 2.0289794220177395 difference: 0.05142067720854948 Number of Clusters: 8 Average Distortion: 1.964144163389972 difference: 0.0648352586277674 Number of Clusters: 9 Average Distortion: 1.9221492045198068 difference: 0.041994958870165355 Number of Clusters: 10 Average Distortion: 1.8513913649973124 difference: 0.07075783952249437 Number of Clusters: 11 Average Distortion: 1.8024134734578485 difference: 0.04897789153946386 Number of Clusters: 12 Average Distortion: 1.7900931879652673 difference: 0.01232028549258124 Number of Clusters: 13 Average Distortion: 1.7417609203336912 difference: 0.048332267631576054 Number of Clusters: 14 Average Distortion: 1.673559857259703 difference: 0.06820106307398821
Wall time: 3.65 s
%%time
model = KMeans(random_state=1)
visualizer = KElbowVisualizer(model, k=(1, 15), timings=True)
visualizer.fit(k_means_df)
visualizer.show()
Wall time: 3.48 s
<AxesSubplot:title={'center':'Distortion Score Elbow for KMeans Clustering'}, xlabel='k', ylabel='distortion score'>
For the silhouette scores, I am going to compute the silhouette scores for the same numbers, plus I am going to compute the score for 10 clusters
* I know 10 clusters is incorrect, but I wanted to see what a bad example looked like
%%time
sil_score = []
cluster_list = range(2, 15)
i = 1 # For computing differences in silhouette scores
for n_clusters in cluster_list:
clusterer = KMeans(n_clusters=n_clusters, random_state=1)
preds = clusterer.fit_predict((subset_scaled_df))
score = silhouette_score(k_means_df, preds)
sil_score.append(score)
if i == 1: # print silhouette score and difference from previous silhouette score
print("For n_clusters = {}, the silhouette score is {}".format(n_clusters, score))
i += 1
prevscore = score
else:
print("For n_clusters = {}, the silhouette score is {}".format(n_clusters, score),
"\tdifference:", prevscore-score)
i += 1
prevscore = score
plt.plot(cluster_list, sil_score)
plt.show()
print('\n')
For n_clusters = 2, the silhouette score is 0.43969639509980457 For n_clusters = 3, the silhouette score is 0.4644405674779404 difference: -0.02474417237813581 For n_clusters = 4, the silhouette score is 0.4577225970476733 difference: 0.006717970430267062 For n_clusters = 5, the silhouette score is 0.43228336443659804 difference: 0.02543923261107528 For n_clusters = 6, the silhouette score is 0.4005422737213617 difference: 0.03174109071523634 For n_clusters = 7, the silhouette score is 0.3976335364987305 difference: 0.002908737222631186 For n_clusters = 8, the silhouette score is 0.40278401969450467 difference: -0.005150483195774158 For n_clusters = 9, the silhouette score is 0.3778585981433699 difference: 0.024925421551134752 For n_clusters = 10, the silhouette score is 0.13458938329968687 difference: 0.24326921484368305 For n_clusters = 11, the silhouette score is 0.1421832155528444 difference: -0.007593832253157545 For n_clusters = 12, the silhouette score is 0.2044669621527429 difference: -0.06228374659989849 For n_clusters = 13, the silhouette score is 0.23424874810104204 difference: -0.02978178594829914 For n_clusters = 14, the silhouette score is 0.12102526472829901 difference: 0.11322348337274303
Wall time: 3.21 s
%%time
model = KMeans(random_state=1)
visualizer = KElbowVisualizer(model, k=(2, 15), metric="silhouette", timings=True)
visualizer.fit(k_means_df)
visualizer.show()
Wall time: 3.3 s
<AxesSubplot:title={'center':'Silhouette Score Elbow for KMeans Clustering'}, xlabel='k', ylabel='silhouette score'>
# finding optimal no. of clusters with silhouette coefficients
visualizer = SilhouetteVisualizer(KMeans(2, random_state=1))
visualizer.fit(k_means_df)
visualizer.show()
<AxesSubplot:title={'center':'Silhouette Plot of KMeans Clustering for 340 Samples in 2 Centers'}, xlabel='silhouette coefficient values', ylabel='cluster label'>
# finding optimal no. of clusters with silhouette coefficients
visualizer = SilhouetteVisualizer(KMeans(3, random_state=1))
visualizer.fit(k_means_df)
visualizer.show()
<AxesSubplot:title={'center':'Silhouette Plot of KMeans Clustering for 340 Samples in 3 Centers'}, xlabel='silhouette coefficient values', ylabel='cluster label'>
# finding optimal no. of clusters with silhouette coefficients
visualizer = SilhouetteVisualizer(KMeans(4, random_state=1))
visualizer.fit(k_means_df)
visualizer.show()
<AxesSubplot:title={'center':'Silhouette Plot of KMeans Clustering for 340 Samples in 4 Centers'}, xlabel='silhouette coefficient values', ylabel='cluster label'>
# finding optimal no. of clusters with silhouette coefficients
visualizer = SilhouetteVisualizer(KMeans(5, random_state=1))
visualizer.fit(k_means_df)
visualizer.show()
<AxesSubplot:title={'center':'Silhouette Plot of KMeans Clustering for 340 Samples in 5 Centers'}, xlabel='silhouette coefficient values', ylabel='cluster label'>
# finding optimal no. of clusters with silhouette coefficients
visualizer = SilhouetteVisualizer(KMeans(6, random_state=1))
visualizer.fit(k_means_df)
visualizer.show()
<AxesSubplot:title={'center':'Silhouette Plot of KMeans Clustering for 340 Samples in 6 Centers'}, xlabel='silhouette coefficient values', ylabel='cluster label'>
# finding optimal no. of clusters with silhouette coefficients
visualizer = SilhouetteVisualizer(KMeans(10, random_state=1))
visualizer.fit(k_means_df)
visualizer.show()
<AxesSubplot:title={'center':'Silhouette Plot of KMeans Clustering for 340 Samples in 10 Centers'}, xlabel='silhouette coefficient values', ylabel='cluster label'>
# final K-means model
kmeans = KMeans(n_clusters=4, random_state=1)
kmeans.fit(k_means_df)
KMeans(n_clusters=4, random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
KMeans(n_clusters=4, random_state=1)
label = kmeans.fit_predict(k_means_df)
cols_list = filtered_label0.columns
print("\n")
print("Comparsion of", cols_list[0], "and", cols_list[1])
plt.scatter(label_0[cols_list[0]], label_0[cols_list[0]], color="red")
plt.scatter(label_1[cols_list[0]], label_1[cols_list[1]], color="black")
plt.scatter(label_2[cols_list[0]], label_2[cols_list[1]], color="green")
plt.scatter(label_3[cols_list[0]], label_3[cols_list[1]], color="blue")
plt.show()
Comparsion of Current Price and Price Change
print("\n")
print("Comparsion of", cols_list[1], "and", cols_list[2])
plt.scatter(label_0[cols_list[1]], label_0[cols_list[2]], color="red")
plt.scatter(label_1[cols_list[1]], label_1[cols_list[2]], color="black")
plt.scatter(label_2[cols_list[1]], label_2[cols_list[2]], color="green")
plt.scatter(label_3[cols_list[1]], label_3[cols_list[2]], color="blue")
plt.show()
Comparsion of Price Change and Volatility
print("\n")
print("Comparsion of", cols_list[2], "and", cols_list[3])
plt.scatter(label_0[cols_list[2]], label_0[cols_list[3]], color="red")
plt.scatter(label_1[cols_list[2]], label_1[cols_list[3]], color="black")
plt.scatter(label_2[cols_list[2]], label_2[cols_list[3]], color="green")
plt.scatter(label_3[cols_list[2]], label_3[cols_list[3]], color="blue")
plt.show()
Comparsion of Volatility and ROE
# creating a copy of the original data
df1 = df.copy()
# adding kmeans cluster labels to the original and scaled dataframes
k_means_df["KM_segments"] = kmeans.labels_
df1["KM_segments"] = kmeans.labels_
df1["KM_segments"].value_counts()
0 277 2 27 3 25 1 11 Name: KM_segments, dtype: int64
Given these reasons, I would proceed with N_clusters at 4
====================================================================================================================
hc_df = subset_scaled_df.copy()
%%time
# list of distance metrics
distance_metrics = ["euclidean", "chebyshev", "mahalanobis", "cityblock"]
# list of linkage methods
linkage_methods = ["single", "complete", "average", "weighted"]
high_cophenet_corr = 0
high_dm_lm = [0, 0]
print('\n')
i = 1 # For printout formatting
for dm in distance_metrics:
for lm in linkage_methods:
Z = linkage(subset_scaled_df, metric=dm, method=lm)
c, coph_dists = cophenet(Z, pdist(subset_scaled_df))
print(
"Cophenetic correlation for {} distance and {} linkage is {}.".format(
dm.capitalize(), lm, c
)
)
i += 1
if i == 5 : # For printout formatting
print('\n')
i = 1
if high_cophenet_corr < c:
high_cophenet_corr = c
high_dm_lm[0] = dm
high_dm_lm[1] = lm
# printing the combination of distance metric and linkage method with the highest cophenetic correlation
print('\n')
print('*'*100)
print(
"Highest cophenetic correlation is {}, which is obtained with {} distance and {} linkage.".format(
high_cophenet_corr, high_dm_lm[0].capitalize(), high_dm_lm[1]
)
)
print('\n') # Formatting for wall time
Cophenetic correlation for Euclidean distance and single linkage is 0.9232271494002922. Cophenetic correlation for Euclidean distance and complete linkage is 0.7873280186580672. Cophenetic correlation for Euclidean distance and average linkage is 0.9422540609560814. Cophenetic correlation for Euclidean distance and weighted linkage is 0.8693784298129404. Cophenetic correlation for Chebyshev distance and single linkage is 0.9062538164750717. Cophenetic correlation for Chebyshev distance and complete linkage is 0.598891419111242. Cophenetic correlation for Chebyshev distance and average linkage is 0.9338265528030499. Cophenetic correlation for Chebyshev distance and weighted linkage is 0.9127355892367. Cophenetic correlation for Mahalanobis distance and single linkage is 0.9259195530524588. Cophenetic correlation for Mahalanobis distance and complete linkage is 0.792530720285. Cophenetic correlation for Mahalanobis distance and average linkage is 0.9247324030159736. Cophenetic correlation for Mahalanobis distance and weighted linkage is 0.8708317490180427. Cophenetic correlation for Cityblock distance and single linkage is 0.9334186366528574. Cophenetic correlation for Cityblock distance and complete linkage is 0.7375328863205818. Cophenetic correlation for Cityblock distance and average linkage is 0.9302145048594667. Cophenetic correlation for Cityblock distance and weighted linkage is 0.731045513520281. **************************************************************************************************** Highest cophenetic correlation is 0.9422540609560814, which is obtained with Euclidean distance and average linkage. Wall time: 121 ms
%%time
# list of linkage methods
linkage_methods = ["single", "complete", "average", "centroid", "ward", "weighted"]
high_cophenet_corr = 0
high_dm_lm = [0, 0]
print('\n')
for lm in linkage_methods:
Z = linkage(subset_scaled_df, metric="euclidean", method=lm)
c, coph_dists = cophenet(Z, pdist(subset_scaled_df))
print("Cophenetic correlation for {} linkage is {}.".format(lm, c))
if high_cophenet_corr < c:
high_cophenet_corr = c
high_dm_lm[0] = "euclidean"
high_dm_lm[1] = lm
# printing the combination of distance metric and linkage method with the highest cophenetic correlation
print('\n')
print('*'*100)
# printing the combination of distance metric and linkage method with the highest cophenetic correlation
print(
"Highest cophenetic correlation is {}, which is obtained with {} linkage.".format(
high_cophenet_corr, high_dm_lm[1]
)
)
print('\n') # Formatting for Wall Time
Cophenetic correlation for single linkage is 0.9232271494002922. Cophenetic correlation for complete linkage is 0.7873280186580672. Cophenetic correlation for average linkage is 0.9422540609560814. Cophenetic correlation for centroid linkage is 0.9314012446828154. Cophenetic correlation for ward linkage is 0.7101180299865353. Cophenetic correlation for weighted linkage is 0.8693784298129404. **************************************************************************************************** Highest cophenetic correlation is 0.9422540609560814, which is obtained with average linkage. Wall time: 40 ms
%%time
# list of linkage methods
linkage_methods = ["single", "complete", "average", "centroid", "ward", "weighted"]
# lists to save results of cophenetic correlation calculation
compare_cols = ["Linkage", "Cophenetic Coefficient"]
compare = []
# to create a subplot image
fig, axs = plt.subplots(len(linkage_methods), 1, figsize=(15, 30))
# We will enumerate through the list of linkage methods above
# For each linkage method, we will plot the dendrogram and calculate the cophenetic correlation
for i, method in enumerate(linkage_methods):
Z = linkage(hc_df, metric="euclidean", method=method)
dendrogram(Z, ax=axs[i])
axs[i].set_title(f"Dendrogram ({method.capitalize()} Linkage)")
coph_corr, coph_dist = cophenet(Z, pdist(hc_df))
axs[i].annotate(
f"Cophenetic\nCorrelation\n{coph_corr:0.2f}",
(0.80, 0.80),
xycoords="axes fraction",
)
compare.append([method, coph_corr])
Wall time: 2.77 s
# create and print a dataframe to compare cophenetic correlations for different linkage methods
df_cc = pd.DataFrame(compare, columns=compare_cols)
df_cc = df_cc.sort_values(by="Cophenetic Coefficient", ascending=False)
df_cc
| Linkage | Cophenetic Coefficient | |
|---|---|---|
| 2 | average | 0.942 |
| 3 | centroid | 0.931 |
| 0 | single | 0.923 |
| 5 | weighted | 0.869 |
| 1 | complete | 0.787 |
| 4 | ward | 0.710 |
The Dendrograms show that the Euclidean distance and average linkage is the best at a Cophenetic Correlation at 0.94
HCmodel = AgglomerativeClustering(n_clusters=6, affinity="euclidean", linkage="average")
HCmodel.fit(subset_scaled_df)
AgglomerativeClustering(linkage='average', n_clusters=6)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
AgglomerativeClustering(linkage='average', n_clusters=6)
# creating a copy of the original data
df2 = df.copy()
# adding hierarchical cluster labels to the original and scaled dataframes
hc_df["HC_segments"] = HCmodel.labels_
df2["HC_segments"] = HCmodel.labels_
cluster_profile = df2.groupby("HC_segments").mean()
cluster_profile["count_in_each_segment"] = (
df2.groupby("HC_segments")["Security"].count().values
)
cluster_profile.style.highlight_max(color="yellow", axis=0)
| Current Price | Price Change | Volatility | ROE | Cash Ratio | Net Cash Flow | Net Income | Earnings Per Share | Estimated Shares Outstanding | P/E Ratio | P/B Ratio | count_in_each_segment | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| HC_segments | ||||||||||||
| 0 | 77.287589 | 4.099730 | 1.518066 | 35.336336 | 66.900901 | -33197321.321321 | 1538074666.666667 | 2.885270 | 560505037.293544 | 32.441706 | -2.174921 | 333 |
| 1 | 25.640000 | 11.237908 | 1.322355 | 12.500000 | 130.500000 | 16755500000.000000 | 13654000000.000000 | 3.295000 | 2791829362.100000 | 13.649696 | 1.508484 | 2 |
| 2 | 24.485001 | -13.351992 | 3.482611 | 802.000000 | 51.000000 | -1292500000.000000 | -19106500000.000000 | -41.815000 | 519573983.250000 | 60.748608 | 1.565141 | 2 |
| 3 | 104.660004 | 16.224320 | 1.320606 | 8.000000 | 958.000000 | 592000000.000000 | 3669000000.000000 | 1.310000 | 2800763359.000000 | 79.893133 | 5.884467 | 1 |
| 4 | 1274.949951 | 3.190527 | 1.268340 | 29.000000 | 184.000000 | -1671386000.000000 | 2551360000.000000 | 50.090000 | 50935516.070000 | 25.453183 | -1.052429 | 1 |
| 5 | 276.570007 | 6.189286 | 1.116976 | 30.000000 | 25.000000 | 90885000.000000 | 596541000.000000 | 8.910000 | 66951851.850000 | 31.040405 | 129.064585 | 1 |
The numbers should be showing up with yellow highlights, but for some reason they are not
for cl in df2["HC_segments"].unique():
print("In cluster {}, the following companies are present:".format(cl))
print(df2[df2["HC_segments"] == cl]["Security"].unique())
print()
In cluster 0, the following companies are present: ['American Airlines Group' 'AbbVie' 'Abbott Laboratories' 'Adobe Systems Inc' 'Analog Devices, Inc.' 'Archer-Daniels-Midland Co' 'Ameren Corp' 'American Electric Power' 'AFLAC Inc' 'American International Group, Inc.' 'Apartment Investment & Mgmt' 'Assurant Inc' 'Arthur J. Gallagher & Co.' 'Akamai Technologies Inc' 'Albemarle Corp' 'Alaska Air Group Inc' 'Allstate Corp' 'Allegion' 'Alexion Pharmaceuticals' 'Applied Materials Inc' 'AMETEK Inc' 'Affiliated Managers Group Inc' 'Amgen Inc' 'Ameriprise Financial' 'American Tower Corp A' 'Amazon.com Inc' 'AutoNation Inc' 'Anthem Inc.' 'Aon plc' 'Anadarko Petroleum Corp' 'Amphenol Corp' 'Arconic Inc' 'Activision Blizzard' 'AvalonBay Communities, Inc.' 'Broadcom' 'American Water Works Company Inc' 'American Express Co' 'Boeing Company' 'Baxter International Inc.' 'BB&T Corporation' 'Bard (C.R.) Inc.' 'Baker Hughes Inc' 'BIOGEN IDEC Inc.' 'The Bank of New York Mellon Corp.' 'Ball Corp' 'Bristol-Myers Squibb' 'Boston Scientific' 'BorgWarner' 'Boston Properties' 'Citigroup Inc.' 'Caterpillar Inc.' 'Chubb Limited' 'CBRE Group' 'Crown Castle International Corp.' 'Carnival Corp.' 'Celgene Corp.' 'CF Industries Holdings Inc' 'Citizens Financial Group' 'Church & Dwight' 'C. H. Robinson Worldwide' 'Charter Communications' 'CIGNA Corp.' 'Cincinnati Financial' 'Colgate-Palmolive' 'Comerica Inc.' 'CME Group Inc.' 'Chipotle Mexican Grill' 'Cummins Inc.' 'CMS Energy' 'Centene Corporation' 'CenterPoint Energy' 'Capital One Financial' 'Cabot Oil & Gas' 'The Cooper Companies' 'CSX Corp.' 'CenturyLink Inc' 'Cognizant Technology Solutions' 'Citrix Systems' 'CVS Health' 'Chevron Corp.' 'Concho Resources' 'Dominion Resources' 'Delta Air Lines' 'Du Pont (E.I.)' 'Deere & Co.' 'Discover Financial Services' 'Quest Diagnostics' 'Danaher Corp.' 'The Walt Disney Company' 'Discovery Communications-A' 'Discovery Communications-C' 'Delphi Automotive' 'Digital Realty Trust' 'Dun & Bradstreet' 'Dover Corp.' 'Dr Pepper Snapple Group' 'Duke Energy' 'DaVita Inc.' 'Devon Energy Corp.' 'eBay Inc.' 'Ecolab Inc.' 'Consolidated Edison' 'Equifax Inc.' "Edison Int'l" 'Eastman Chemical' 'EOG Resources' 'Equinix' 'Equity Residential' 'EQT Corporation' 'Eversource Energy' 'Essex Property Trust, Inc.' 'E*Trade' 'Eaton Corporation' 'Entergy Corp.' 'Edwards Lifesciences' 'Exelon Corp.' "Expeditors Int'l" 'Expedia Inc.' 'Extra Space Storage' 'Ford Motor' 'Fastenal Co' 'Fortune Brands Home & Security' 'Freeport-McMoran Cp & Gld' 'FirstEnergy Corp' 'Fidelity National Information Services' 'Fiserv Inc' 'FLIR Systems' 'Fluor Corp.' 'Flowserve Corporation' 'FMC Corporation' 'Federal Realty Investment Trust' 'First Solar Inc' 'Frontier Communications' 'General Dynamics' 'General Growth Properties Inc.' 'Gilead Sciences' 'Corning Inc.' 'General Motors' 'Genuine Parts' 'Garmin Ltd.' 'Goodyear Tire & Rubber' 'Grainger (W.W.) Inc.' 'Halliburton Co.' 'Hasbro Inc.' 'Huntington Bancshares' 'HCA Holdings' 'Welltower Inc.' 'HCP Inc.' 'Hess Corporation' 'Hartford Financial Svc.Gp.' 'Harley-Davidson' "Honeywell Int'l Inc." 'Hewlett Packard Enterprise' 'HP Inc.' 'Hormel Foods Corp.' 'Henry Schein' 'Host Hotels & Resorts' 'The Hershey Company' 'Humana Inc.' 'International Business Machines' 'IDEXX Laboratories' 'Intl Flavors & Fragrances' 'International Paper' 'Interpublic Group' 'Iron Mountain Incorporated' 'Intuitive Surgical Inc.' 'Illinois Tool Works' 'Invesco Ltd.' 'J. B. Hunt Transport Services' 'Jacobs Engineering Group' 'Juniper Networks' 'JPMorgan Chase & Co.' 'Kimco Realty' 'Kimberly-Clark' 'Kinder Morgan' 'Coca Cola Company' 'Kansas City Southern' 'Leggett & Platt' 'Lennar Corp.' 'Laboratory Corp. of America Holding' 'LKQ Corporation' 'L-3 Communications Holdings' 'Lilly (Eli) & Co.' 'Lockheed Martin Corp.' 'Alliant Energy Corp' 'Leucadia National Corp.' 'Southwest Airlines' 'Level 3 Communications' 'LyondellBasell' 'Mastercard Inc.' 'Mid-America Apartments' 'Macerich' "Marriott Int'l." 'Masco Corp.' 'Mattel Inc.' "McDonald's Corp." "Moody's Corp" 'Mondelez International' 'MetLife Inc.' 'Mohawk Industries' 'Mead Johnson' 'McCormick & Co.' 'Martin Marietta Materials' 'Marsh & McLennan' '3M Company' 'Monster Beverage' 'Altria Group Inc' 'The Mosaic Company' 'Marathon Petroleum' 'Merck & Co.' 'Marathon Oil Corp.' 'M&T Bank Corp.' 'Mettler Toledo' 'Murphy Oil' 'Mylan N.V.' 'Navient' 'Noble Energy Inc' 'NASDAQ OMX Group' 'NextEra Energy' 'Newmont Mining Corp. (Hldg. Co.)' 'Netflix Inc.' 'Newfield Exploration Co' 'Nielsen Holdings' 'National Oilwell Varco Inc.' 'Norfolk Southern Corp.' 'Northern Trust Corp.' 'Nucor Corp.' 'Newell Brands' 'Realty Income Corporation' 'ONEOK' 'Omnicom Group' "O'Reilly Automotive" 'Occidental Petroleum' "People's United Financial" 'Pitney-Bowes' 'PACCAR Inc.' 'PG&E Corp.' 'Public Serv. Enterprise Inc.' 'PepsiCo Inc.' 'Pfizer Inc.' 'Principal Financial Group' 'Procter & Gamble' 'Progressive Corp.' 'Pulte Homes Inc.' 'Philip Morris International' 'PNC Financial Services' 'Pentair Ltd.' 'Pinnacle West Capital' 'PPG Industries' 'PPL Corp.' 'Prudential Financial' 'Phillips 66' 'Quanta Services Inc.' 'Praxair Inc.' 'PayPal' 'Ryder System' 'Royal Caribbean Cruises Ltd' 'Regeneron' 'Robert Half International' 'Roper Industries' 'Range Resources Corp.' 'Republic Services Inc' 'SCANA Corp' 'Charles Schwab Corporation' 'Spectra Energy Corp.' 'Sealed Air' 'Sherwin-Williams' 'SL Green Realty' 'Scripps Networks Interactive Inc.' 'Southern Co.' 'Simon Property Group Inc' 'S&P Global, Inc.' 'Stericycle Inc' 'Sempra Energy' 'SunTrust Banks' 'State Street Corp.' 'Skyworks Solutions' 'Southwestern Energy' 'Synchrony Financial' 'Stryker Corp.' 'AT&T Inc' 'Molson Coors Brewing Company' 'Teradata Corp.' 'Tegna, Inc.' 'Torchmark Corp.' 'Thermo Fisher Scientific' 'TripAdvisor' 'The Travelers Companies Inc.' 'Tractor Supply Company' 'Tyson Foods' 'Tesoro Petroleum Co.' 'Total System Services' 'Texas Instruments' 'Under Armour' 'United Continental Holdings' 'UDR Inc' 'Universal Health Services, Inc.' 'United Health Group Inc.' 'Unum Group' 'Union Pacific' 'United Parcel Service' 'United Technologies' 'Varian Medical Systems' 'Valero Energy' 'Vulcan Materials' 'Vornado Realty Trust' 'Verisk Analytics' 'Verisign Inc.' 'Vertex Pharmaceuticals Inc' 'Ventas Inc' 'Verizon Communications' 'Waters Corporation' 'Wec Energy Group Inc' 'Wells Fargo' 'Whirlpool Corp.' 'Waste Management Inc.' 'Williams Cos.' 'Western Union Co' 'Weyerhaeuser Corp.' 'Wyndham Worldwide' 'Wynn Resorts Ltd' 'Cimarex Energy' 'Xcel Energy Inc' 'XL Capital' 'Exxon Mobil Corp.' 'Dentsply Sirona' 'Xerox Corp.' 'Xylem Inc.' 'Yahoo Inc.' 'Yum! Brands Inc' 'Zimmer Biomet Holdings' 'Zions Bancorp' 'Zoetis'] In cluster 5, the following companies are present: ['Alliance Data Systems'] In cluster 2, the following companies are present: ['Apache Corporation' 'Chesapeake Energy'] In cluster 1, the following companies are present: ['Bank of America Corp' 'Intel Corp.'] In cluster 3, the following companies are present: ['Facebook'] In cluster 4, the following companies are present: ['Priceline.com Inc']
df2.groupby(["HC_segments", "GICS Sector"])["Security"].count()
HC_segments GICS Sector
0 Consumer Discretionary 39
Consumer Staples 19
Energy 28
Financials 48
Health Care 40
Industrials 53
Information Technology 30
Materials 20
Real Estate 27
Telecommunications Services 5
Utilities 24
1 Financials 1
Information Technology 1
2 Energy 2
3 Information Technology 1
4 Consumer Discretionary 1
5 Information Technology 1
Name: Security, dtype: int64
%%time
plt.figure(figsize=(20, 20))
plt.suptitle("Boxplot of numerical variables for each cluster")
for i, variable in enumerate(cols_list):
plt.subplot(3, 4, i + 1)
sns.boxplot(data=df2, x="HC_segments", y=variable)
plt.tight_layout(pad=2.0)
Wall time: 1.16 s
df2.groupby("HC_segments").mean().plot.bar(figsize=(15, 6))
<AxesSubplot:xlabel='HC_segments'>
Sector 0
Sector 1
Sector 2:
Sector 3:
Sector 4:
Sector 5:
You compare several things, like:
You can also mention any differences or similarities you obtained in the cluster profiles from both the clustering techniques.
* This is not a recommended best practice